Add SDPA backend tests and refactor generate.py #1477

mikekgfb · 2025-01-24T02:52:34Z

Push backend manager into caller

pytorch-bot · 2025-01-24T02:52:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1477

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aee35a5 with merge base 083fdaf ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add tests for backends

print out parameters during execution

mikekgfb · 2025-01-24T23:27:37Z

The new attention-backend option has a new interesting wrinkle: it only specifies one backend, but if that kernel does not work for paramters, things fail. At a minimum we may always to have specify "math" as a backup? (see specific error below)

Also, should we have a backend option "auto" that relies simply on the sdpa logic to find the best kernel?

  ******* cuda bfloat16 flash_attention 
  + python torchchat.py generate --checkpoint-path checkpoints/stories15M/stories15M.pt --attention-backend flash_attention --device cuda --dtype bfloat16 --temperature 0
  NumExpr defaulting to 16 threads.
  PyTorch version 2.7.0.dev20250124+cu124 available.
  Warning: PTEModel (ExecuTorch) not available with exception: No module named 'executorch'
  Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/pytorch/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
  Using device=cuda NVIDIA A10G
  Loading model...
  Time to load model: 0.18 seconds
  -----------------------------------------------------------
  /pytorch/torchchat/torchchat/model.py:897: UserWarning: Memory efficient kernel not used because: (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:784.)
    y = F.scaled_dot_product_attention(
  /pytorch/torchchat/torchchat/model.py:897: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at /pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:548.)
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
      main()
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
      run_cmd_or_die(f"docker exec -t {container_name} /exec")
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
      raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
  RuntimeError: Command docker exec -t 0fdb53b0e3bc3278763141f75f84b0cac3a9f6b5b3f38b52267a9b6b16de883b /exec failed with exit code 1
    y = F.scaled_dot_product_attention(
  /pytorch/torchchat/torchchat/model.py:897: UserWarning: Flash attention kernel not used because: (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:786.)
    y = F.scaled_dot_product_attention(
  /pytorch/torchchat/torchchat/model.py:897: UserWarning: Flash Attention does not support non-null attn_mask. (Triggered internally at /pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:261.)
    y = F.scaled_dot_product_attention(
  /pytorch/torchchat/torchchat/model.py:897: UserWarning: CuDNN attention kernel not used because: (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:788.)
    y = F.scaled_dot_product_attention(
  /pytorch/torchchat/torchchat/model.py:897: UserWarning: CuDNN attention has been runtime disabled. (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:535.)
    y = F.scaled_dot_product_attention(
  Traceback (most recent call last):
    File "/pytorch/torchchat/torchchat.py", line 96, in <module>
      generate_main(args)
    File "/pytorch/torchchat/torchchat/generate.py", line 1648, in main
      run_generator(args)
    File "/pytorch/torchchat/torchchat/generate.py", line 1630, in run_generator
      for _ in gen.chat(generator_args):
    File "/pytorch/torchchat/torchchat/generate.py", line 1194, in chat
      for token_tensor, metrics in generator_func:
    File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
      response = gen.send(None)
                 ^^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/generate.py", line 735, in generate
      next_token = self.prefill(
                   ^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/generate.py", line 491, in prefill
      logits = model(x, input_pos)
               ^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1760, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/model.py", line 568, in forward
      return self.model(tokens, input_pos)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1760, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/model.py", line 736, in forward
      x = layer(x, input_pos, freqs_cis, mask, cache_lane=cache_lane)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1760, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/model.py", line 770, in forward
      h = x + self.attention(
              ^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1760, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/model.py", line 897, in forward
      y = F.scaled_dot_product_attention(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: No available kernel. Aborting execution.
  Error: Process completed with exit code 1.

@Jack-Khuu

Allow math as fallback

Jack-Khuu · 2025-01-25T01:01:10Z

it only specifies one backend, but if that kernel does not work for paramters, things fail.

In a perfect case, we would have a support matrix for configurations we are confident in (with backing tests), and spit out right at the start, a warning when arguments trek into territory that we are not guaranteeing cc: @yanbing-j

At a minimum we may always to have specify "math" as a backup?

I'm cautious of torchchat's use of fallbacks (and in other pytorch projects in general, but that's a different bag of worms), since it opens the door for misinterpretation; Users thinking we are doing what they ask, when in reality it's succeeding with a different config.
In a perfect world, if an argument isn't in a config we are confident in, we should

warn early and fail early [In prebuilt package] (packaging is something we'll add this half)
warn early and continue execution to potential failure [In dev mode]

Thoughts?

Also, should we have a backend option "auto" that relies simply on the sdpa logic to find the best kernel?

This sounds amazing. Having an explicit list with justifications for the rankings also let's us catch if numerics look fishy

yanbing-j · 2025-01-26T02:39:23Z

it only specifies one backend, but if that kernel does not work for paramters, things fail.

In a perfect case, we would have a support matrix for configurations we are confident in (with backing tests), and spit out right at the start, a warning when arguments trek into territory that we are not guaranteeing cc: @yanbing-j

Confirmed that for CPU side, only math and flash_attention can be chosen as attention backend. And so far, flash_attention does not encounter the issue of that kernel does not work for parameters from our validation. Please ping me when error occurs and we need investigate case by case.

mikekgfb · 2025-01-27T18:08:55Z

@Jack-Khuu:

I'm cautious of torchchat's use of fallbacks (and in other pytorch projects in general, but that's a different bag of worms), since it opens the door for misinterpretation; Users thinking we are doing what they ask, when in reality it's succeeding with a different config.
In a perfect world, if an argument isn't in a config we are confident in, we should

warn early and fail early [In prebuilt package] (packaging is something we'll add this half)
warn early and continue execution to potential failure [In dev mode]

I'm not so enamored of the "fail early". "Fail early" turns into a "no configuration ever works unless perfect", and that quickly becomes the empty set with overly fragile systems. I agree with the "misleading" configuration issue. A user might feel that giving them MATH when they asked for FLASH. This one in particular is a very real concern -- at least in the past, the causal mask needed to be specified by a bool flag for flash attention to be selected, and having a general mask forced the model into the MATH fallback because it could practically be any mask at runtime - The transformer.py in pytorch detects that in a very brute force way -> https://github.com/pytorch/pytorch/blob/30dea8429d977bfaf0f7e17bd39c2b7d1794359c/torch/nn/modules/transformer.py#L1170

OTOH, if we don't posit that things will just work. The moral equivalent in my mind is the example of a compiler that will only compile programs it has been tested on, because everything else can't be expected to work. That's not useful for the end user, and transfer a lot of responsibility to them -- responsibility that they don't have enough data to make a meaningful call. Imagine if every compiler returned "Unknown program. Compiled code not guaranteed to be correct." and thereby transfers the responsibility to the compiler user.

Tests from pytorch#1477 only, without the generate.py refactor

Jack-Khuu · 2025-01-27T23:37:35Z

All good points!! Convinced that refusing to run, is a bad experience

For the sake of avoiding indirection, I suggest we execute the config as requested; allowing any errors or bad perf surface to the caller, potentially failing at execution time.

We catch the error and suggest the fallback instead of running with by default.
It's trivially more churn, and makes the caller intentionally acknowledge the configuration

For the sake of testing we can omit or check for the failure cases

mikekgfb · 2025-01-28T00:20:46Z

it only specifies one backend, but if that kernel does not work for paramters, things fail.

In a perfect case, we would have a support matrix for configurations we are confident in (with backing tests), and spit out right at the start, a warning when arguments trek into territory that we are not guaranteeing cc: @yanbing-j

Confirmed that for CPU side, only math and flash_attention can be chosen as attention backend. And so far, flash_attention does not encounter the issue of that kernel does not work for parameters from our validation. Please ping me when error occurs and we need investigate case by case.

#1480 runs a matrix of all tests against the attention backend options. In a nutshell, only MATH is guaranteeed to handle all inputs, so if we exclude MATH we'll naturally get scenarios where we can't run correctly.

TBH, given the call site is passing a mask that the flash attention kernel does not support, I don't think we can ever get it to use flash attention.

Update generate.py

8e18e7f

Push backend manager into caller

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 24, 2025

Update more-tests.yml

3c3b367

Add tests for backends

mikekgfb changed the title ~~Update generate.py~~ Add SDPA backend tests and refactor generate.py Jan 24, 2025

mikekgfb added 2 commits January 24, 2025 12:28

Merge branch 'pytorch:main' into patch-44

8178a39

Update more-tests.yml

e3933b2

print out parameters during execution

mikekgfb added 2 commits January 24, 2025 15:30

Update generate.py

ed8ab55

Allow math as fallback

Merge branch 'pytorch:main' into patch-44

f44fad1

mikekgfb added a commit to mikekgfb/torchchat-1 that referenced this pull request Jan 27, 2025

Add attention backend tests to more-tests.yml

b52d1b4

Tests from pytorch#1477 only, without the generate.py refactor

mikekgfb mentioned this pull request Jan 27, 2025

Add attention backend tests to more-tests.yml #1480

Open

Merge branch 'main' into patch-44

aee35a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SDPA backend tests and refactor generate.py #1477

Add SDPA backend tests and refactor generate.py #1477

Uh oh!

mikekgfb commented Jan 24, 2025

Uh oh!

pytorch-bot bot commented Jan 24, 2025 •

edited

Loading

Uh oh!

mikekgfb commented Jan 24, 2025

Uh oh!

Jack-Khuu commented Jan 25, 2025

Uh oh!

yanbing-j commented Jan 26, 2025

Uh oh!

mikekgfb commented Jan 27, 2025

Uh oh!

Jack-Khuu commented Jan 27, 2025

Uh oh!

mikekgfb commented Jan 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add SDPA backend tests and refactor generate.py #1477

Are you sure you want to change the base?

Add SDPA backend tests and refactor generate.py #1477

Uh oh!

Conversation

mikekgfb commented Jan 24, 2025

Uh oh!

pytorch-bot bot commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1477

✅ No Failures

Uh oh!

mikekgfb commented Jan 24, 2025

Uh oh!

Jack-Khuu commented Jan 25, 2025

Uh oh!

yanbing-j commented Jan 26, 2025

Uh oh!

mikekgfb commented Jan 27, 2025

Uh oh!

Jack-Khuu commented Jan 27, 2025

Uh oh!

mikekgfb commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 24, 2025 •

edited

Loading

mikekgfb commented Jan 28, 2025 •

edited

Loading